Alfred Job Network GraphΒΆ
This notebook builds on the proposal in FinalProjectProposa_Draft.pdf and uses job data stored in the alfred_db PostgreSQL database.
We will extract the latest postings generated by Alfred, engineer job & skill relationships, and analyze the resulting graph structure.
Workflow OverviewΒΆ
There are some companian files in this project that help with runnign this notebook: 'artifacts/Jobs_clean.csv' outputs cleaned job data from the database 'artifacts/jobs_skills_map.csv' exploded job-skill mapping data 'artifacts/jobs_clean.parquet' parquet version of cleaned job data 'artifacts/jobs_skill_map.parquet' parquet version of job-skill mapping data
I am pulling directly from the database in this notebook, but these files can be used to skip that step if needed.
If your are pulling from the database, the workflow is as follows:
- Import necessary libraries for data manipulation, database connection, and graph analysis.
- Load environment variables (or manually provide credentials) that describe how to reach
alfred_db. - Connect to PostgreSQL, pull the
jobstable into a pandas DataFrame, and perform light cleaning. - Use spaCy-based NLP to infer skill phrases directly from each job description.
%pip install -q python-dotenv sqlalchemy psycopg2-binary pandas networkx matplotlib plotly-express tqdm spacy scikit-learn
Note: you may need to restart the kernel to use updated packages.
# Core analysis imports plus visualization + progress helpers used throughout the notebook.
import os
from pathlib import Path
from typing import List, Tuple
import pandas as pd
from sqlalchemy import create_engine, text
from sqlalchemy.engine import Engine
from dotenv import load_dotenv
import networkx as nx
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from tqdm.auto import tqdm
import networkx.algorithms.community as nx_comm
from collections import defaultdict
import re
from typing import List, Tuple
import itertools
from collections import Counter
import spacy
from spacy.cli import download as spacy_download
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
pd.set_option("display.max_colwidth", 100)
pd.set_option("display.max_rows", 20)
import plotly.io as pio
pio.renderers.default = 'png' # ensure Plotly renders as static images for exports
Load Environment Variables and Create a Database EngineΒΆ
I am Storing my database credentials in a .env file for security and convenience.
But for your covenieence you can start at the section Titles:"FROM CSV FILES" below to skip the database connection step.
PROJECT_ROOT = Path.cwd()
ALFRED_ROOT = (
PROJECT_ROOT.parents[0] / "alfred"
).resolve() # Look one level up for the Alfred repo.
# Load whichever .env files exist so DATABASE_URL and friends are available without manual export.
candidate_env_files = [PROJECT_ROOT / ".env", ALFRED_ROOT / ".env"]
for env_path in candidate_env_files:
if env_path.exists():
load_dotenv(env_path, override=False)
print(f"Loaded environment variables from {env_path}")
DATABASE_URL = os.getenv("DATABASE_URL")
if not DATABASE_URL:
raise RuntimeError(
"DATABASE_URL is not set. Create a .env file or export the variable before running this notebook."
)
# Create a reusable SQLAlchemy engine that downstream cells can share.
engine: Engine = create_engine(DATABASE_URL)
engine
Loaded environment variables from d:\Data620\FinalProject\.env
Engine(postgresql+psycopg2://alfred:***@localhost:5432/alfred_db)
Extract the Latest Jobs Table SnapshotΒΆ
We query the core columns required for downstream analysis. Adjust the SQL to join additional tables once they are available (e.g., embeddings, generated resume artifacts, or recruiter contact data).
def fetch_jobs(engine: Engine) -> pd.DataFrame:
"""Return the latest Alfred job postings needed for the network analysis."""
query = text(
"""
SELECT
id,
title,
company,
location,
description,
source_url,
match_score
FROM jobs
ORDER BY id DESC;
"""
) # Adjust the SQL when you want to join extra tables/columns.
return pd.read_sql_query(query, engine)
# Pull the raw dataset once so downstream wrangling operates on an in-memory DataFrame.
jobs_raw = fetch_jobs(engine)
print(f"Retrieved {len(jobs_raw):,} job postings from Alfred.")
jobs_raw.head()
Retrieved 1,208 job postings from Alfred.
| id | title | company | location | description | source_url | match_score | |
|---|---|---|---|---|---|---|---|
| 0 | 1370 | Data Scientist | Insight Global | South Camden, Camden County | Job Description The Cooper Health System is seeking a Data Scientist to join its analytics team.... | https://www.adzuna.com/land/ad/5513717171?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN |
| 1 | 1369 | Data Scientist | Home Depot | Atlanta, Fulton County | Position Purpose: The Data Scientist is responsible for supporting data science initiatives that... | https://www.adzuna.com/land/ad/5522795794?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN |
| 2 | 1368 | Data Scientist | Stanford University | Stanford, Santa Clara County | Data Scientist Graduate School of Education, Stanford, California, United States Information Ana... | https://www.adzuna.com/land/ad/5454326822?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN |
| 3 | 1367 | Data Scientist | Manulife | Boston, Suffolk County | At Manulife, data scientists collaborate with engineers, IT, and experts to build analytical mod... | https://www.adzuna.com/land/ad/5513716776?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN |
| 4 | 1366 | Data Scientist | Generac Power Systems | Pewaukee, Waukesha County | We are Generac, a leading energy technology company committed to powering a smarter world. Over ... | https://www.adzuna.com/land/ad/5528566412?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN |
Clean Text Fields and Create Helper ColumnsΒΆ
We standardize key string columns, drop duplicate source_urls, and prepare a normalized description column for lightweight NLP.
ICON_PATTERN = r"[????????????????????????????????]"
# Basic QA + normalization to reduce noisy punctuation before NLP-heavy steps.
jobs_clean = (
jobs_raw.copy()
.dropna(subset=["title", "description"])
.drop_duplicates(subset=["source_url"], keep="first")
.assign(
title=lambda df: df["title"].str.strip(),
company=lambda df: df["company"].fillna("Unknown").str.strip(),
location=lambda df: df["location"].fillna("Remote").str.strip(),
description=lambda df: df["description"]
.str.replace(
ICON_PATTERN, " ", regex=True
) # strip arrow/bullet icons that survive scraping
.str.replace(r"\s+", " ", regex=True) # collapse multi-space/newline sequences
.str.strip(),
)
)
# Helper columns reused in later analysis/plots.
jobs_clean["description_lower"] = jobs_clean["description"].str.lower()
jobs_clean["job_label"] = jobs_clean.apply(
lambda row: f"{row['title']} @ {row['company']}", axis=1
)
print(f"{len(jobs_clean):,} clean job postings remaining after QA.")
jobs_clean.head()
1,208 clean job postings remaining after QA.
| id | title | company | location | description | source_url | match_score | description_lower | job_label | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1370 | Data Scientist | Insight Global | South Camden, Camden County | Job Description The Cooper Health System is seeking a Data Scientist to join its analytics team.... | https://www.adzuna.com/land/ad/5513717171?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN | job description the cooper health system is seeking a data scientist to join its analytics team.... | Data Scientist @ Insight Global |
| 1 | 1369 | Data Scientist | Home Depot | Atlanta, Fulton County | Position Purpose: The Data Scientist is responsible for supporting data science initiatives that... | https://www.adzuna.com/land/ad/5522795794?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN | position purpose: the data scientist is responsible for supporting data science initiatives that... | Data Scientist @ Home Depot |
| 2 | 1368 | Data Scientist | Stanford University | Stanford, Santa Clara County | Data Scientist Graduate School of Education, Stanford, California, United States Information Ana... | https://www.adzuna.com/land/ad/5454326822?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN | data scientist graduate school of education, stanford, california, united states information ana... | Data Scientist @ Stanford University |
| 3 | 1367 | Data Scientist | Manulife | Boston, Suffolk County | At Manulife, data scientists collaborate with engineers, IT, and experts to build analytical mod... | https://www.adzuna.com/land/ad/5513716776?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN | at manulife, data scientists collaborate with engineers, it, and experts to build analytical mod... | Data Scientist @ Manulife |
| 4 | 1366 | Data Scientist | Generac Power Systems | Pewaukee, Waukesha County | We are Generac, a leading energy technology company committed to powering a smarter world. Over ... | https://www.adzuna.com/land/ad/5528566412?se=mvDVO3HX8BGRNdJTGBOSjQ&utm_medium=api&utm_source=1f... | NaN | we are generac, a leading energy technology company committed to powering a smarter world. over ... | Data Scientist @ Generac Power Systems |
** "From CSV Files" ** Start here to skip the database connection step. run the next cell to load the csv files directly.
jobs_clean.to_csv("artifacts/jobs_clean.csv", index=False)
summary = (
jobs_clean.groupby("title")
.agg(num_roles=("id", "count"), avg_match_score=("match_score", "mean"))
.reset_index()
.sort_values("num_roles", ascending=False)
)
summary.head(20)
| title | num_roles | avg_match_score | |
|---|---|---|---|
| 53 | Data Engineer | 254 | 0.431432 |
| 254 | Sr. Azure Data Engineer | 100 | 0.449779 |
| 265 | Staff Data Engineer | 65 | 0.414906 |
| 23 | Business Analyst | 47 | NaN |
| 37 | Data Analyst | 31 | 0.519519 |
| 184 | Senior Data Engineer | 25 | 0.412558 |
| 61 | Data Engineer / Senior Data Engineer (GCP, BigQuery) | 25 | 0.452121 |
| 60 | Data Engineer / Senior Data Engineer (AI/ML) | 25 | 0.432949 |
| 170 | Principal Software Engineer | 17 | NaN |
| 181 | Senior Data Analyst | 17 | 0.506788 |
| 34 | Consultant, Data Engineering | 16 | 0.465818 |
| 168 | Principal Data Engineer | 16 | 0.414667 |
| 206 | Senior Software Engineer | 16 | NaN |
| 264 | Sr. Software Engineer | 15 | NaN |
| 13 | AWS Data Engineer | 15 | 0.471633 |
| 57 | Data Engineer - Manager | 14 | 0.422792 |
| 157 | Lead Software Engineer | 14 | NaN |
| 102 | GCP Data Engineer | 14 | 0.394899 |
| 252 | Sr Data Engineer | 14 | 0.461888 |
| 124 | IT Support Engineer | 12 | NaN |
Dictionary-Based Skill Extraction (Baseline)ΒΆ
We first apply the curated keyword dictionary to tag skills, preserving the original deterministic baseline. These results are saved separately so we can compare coverage against the NLP-driven extraction that follows.
import re
from typing import List, Tuple
# --- 1. CLEANED AND CORRECTED SKILL DICTIONARY ---
SKILLS_DICTIONARY = {
"Programming_Tools": [
"Python",
"R",
"SQL",
"Scala",
"Java",
"Julia",
"Bash",
"Jupyter",
"Git",
"GitHub",
"Pandas",
"NumPy",
"SciPy",
"Matplotlib",
"Seaborn",
"Plotly",
],
"MLOps_DevOps_Cloud": [
"MLOps",
"Docker",
"Kubernetes",
"Airflow",
"MLflow",
"Kubeflow",
"DVC",
"AWS",
"SageMaker",
"Azure ML",
"GCP",
"BigQuery",
"S3",
"Lambda",
"Databricks",
"Snowflake",
"Spark",
"PySpark",
"Hadoop",
"Kafka",
],
"Machine_Learning_and_Artificial_Intelligence": [
"Machine Learning",
"Deep Learning",
"Neural Network",
"NLP",
"Natural Language Processing",
"Computer Vision",
"CV",
"Reinforcement Learning",
"Supervised Learning",
"Unsupervised Learning",
"Semi-Supervised Learning",
"Transfer Learning",
"Time Series",
"Anomaly Detection",
"Clustering",
"Classification",
"Regression",
"Cross Validation",
"Dimensionality Reduction",
"PCA",
"t-SNE",
"UMAP",
"scikit-learn",
"sklearn",
"statsmodels",
"PyTorch",
"TensorFlow",
"Keras",
"XGBoost",
"LightGBM",
"CatBoost",
"Prophet",
"Sktime",
],
"Statistics": [
"Statistics",
"Statistical Modeling",
"Hypothesis Testing",
"Bayesian Statistics",
"Confidence Intervals",
"P-Values",
"ANOVA",
"Regression Analysis",
"Chisquared Test",
"Distribution Fitting",
"Data Exploration",
"Summary Statistics",
"Descriptive Statistics",
],
"Finance_Fintech_Banking": [
"Financial Markets",
"Asset Management",
"Quantitative Finance",
"Quant",
"Credit Risk",
"Market Risk",
"Operational Risk",
"Underwriting",
"Fraud Detection",
"AML",
"Anti-Money Laundering",
"KYC",
"Know Your Customer",
"Basel III",
"MiFID",
"SEC",
"Trading Algorithm",
"Algorithmic Trading",
"Fixed Income",
"Derivatives",
"Hedge Fund",
"Valuation",
"Credit Card",
],
"Healthcare_Pharma_Biotech": [
"Healthtech",
"Biomedical",
"Clinical Trial",
"Drug Discovery",
"EHR",
"Electronic Health Record",
"Epic",
"Cerner",
"FHIR",
"HIPAA",
"PHI",
"Protected Health Information",
"CMS",
"Claims Data",
"ICD-10",
"CPT Codes",
"Payer",
"Provider",
"HEDIS",
"Population Health",
"Biostatistics",
"Genomics",
"Radiology",
"Medical Imaging",
],
"Insurance_Risk_Underwriting": [
"Underwriting",
"Actuarial Science",
"Loss Ratio",
"Claims Analysis",
"Cat Modeling",
"Catastrophe Modeling",
"P&C",
"Property & Casualty",
"Life Insurance",
"Pricing Model",
"Reserving",
"Policy Administration",
"Regulatory Compliance",
"Solvency II",
"Capital Modeling",
"Telematics",
],
"General_Business_Analytical": [
"A/B Testing",
"Experimentation",
"ROI Analysis",
"LTV",
"Customer Lifetime Value",
"Churn Prediction",
"Demand Forecasting",
"Supply Chain",
"Logistics",
"Pricing Optimization",
"Causal Inference",
"Econometrics",
"KPIs",
],
"Soft_Skills_Misc": [
"Communication Skills",
"Teamwork",
"Problem Solving",
"Critical Thinking",
"Steakholder Management",
"Steakholder Engagement",
"Priotization",
"Self Starter",
"Detail Oriented",
"Start up",
],
"Software_Development_Engineering": [
"API",
"REST",
"OOP",
"Data Structures",
"Algorithims",
"Microservices",
"CI/CD",
"Agile",
"Scrum",
"TDD",
"Unit Testing",
"Integration Testing",
"System Design",
"Cloud Computing",
"JavaScript",
"TypeScript",
"C++",
"C#",
"Ruby",
"PHP",
"Go",
"Swift",
"Objective-C",
"HTML",
"CSS",
"React",
"Angular",
"Vue.js",
"Node.js",
"Django",
"Flask",
],
"Productivity_Tools": [
"Excel",
"Tableau",
"Power BI",
"Looker",
"Google Data Studio",
"PowerPoint",
"Word",
"Visio",
"Notion",
"Confluence",
"Slack",
"Trello",
"Asana",
"Jira","Google Sheets","Google Docs","Micro Soft Office"
],
}
SKILL_PATTERNS = {
skill: {
"category": category,
"pattern": re.compile(rf"\b{re.escape(skill)}\b", re.IGNORECASE),
}
for category, skills in SKILLS_DICTIONARY.items()
for skill in skills
}
def extract_skills_dictionary(text: str) -> List[Tuple[str, str]]:
matches: List[Tuple[str, str]] = []
for skill, meta in SKILL_PATTERNS.items():
if meta["pattern"].search(text):
matches.append((skill.title(), meta["category"]))
return matches
jobs_clean["skills_dictionary"] = jobs_clean["description_lower"].apply(extract_skills_dictionary)
jobs_skill_map_dictionary = (
jobs_clean[["id", "job_label", "company", "skills_dictionary"]]
.explode("skills_dictionary")
.dropna(subset=["skills_dictionary"])
)
jobs_skill_map_dictionary["skill_name"] = jobs_skill_map_dictionary["skills_dictionary"].apply(lambda item: item[0])
jobs_skill_map_dictionary["skill_category"] = jobs_skill_map_dictionary["skills_dictionary"].apply(lambda item: item[1])
print(
f"Dictionary extractor tagged {jobs_skill_map_dictionary['skill_name'].nunique():,} unique skills across {jobs_skill_map_dictionary['id'].nunique():,} jobs."
)
jobs_clean["skills"] = jobs_clean["skills_dictionary"].apply(
lambda pairs: [skill for skill, _ in pairs] if isinstance(pairs, list) else []
)
jobs_skill_map = jobs_skill_map_dictionary.copy()
Dictionary extractor tagged 128 unique skills across 490 jobs.
# Build the bipartite job-skill graph (jobs on one partition, skills on the other).
B = nx.Graph()
# Add job nodes so each posting is uniquely identified in the graph.
for _, row in jobs_clean.iterrows():
job_node = f"job_{row['id']}"
B.add_node(
job_node,
bipartite="job",
label=row["job_label"],
company=row["company"],
location=row["location"],
)
# Add skill nodes and connect them to the jobs that mention them.
for _, row in jobs_skill_map.iterrows():
skill_node = f"skill_{row['skill_name'].lower().replace(' ', '_')}"
B.add_node(
skill_node,
bipartite="skill",
label=row["skill_name"],
category=row["skill_category"],
)
job_node = f"job_{row['id']}"
B.add_edge(job_node, skill_node)
print(f"Graph has {B.number_of_nodes():,} nodes and {B.number_of_edges():,} edges.")
Graph has 1,336 nodes and 1,973 edges.
MAX_JOBS_FOR_BIPARTITE_PLOT = jobs_clean.shape[0] # Adjust this number to limit the jobs shown in the plot.
sample_job_nodes = [
f"job_{job_id}" for job_id in jobs_clean.head(MAX_JOBS_FOR_BIPARTITE_PLOT)["id"]
]
connected_skill_nodes = set()
for job in sample_job_nodes:
connected_skill_nodes.update(B.neighbors(job))
subgraph_nodes = sample_job_nodes + list(connected_skill_nodes)
H_bipartite = B.subgraph(subgraph_nodes).copy()
if not H_bipartite:
raise ValueError(
"Bipartite subgraph is empty. Ensure jobs_skill_map is populated before plotting."
)
job_nodes_sub = [
n for n, d in H_bipartite.nodes(data=True) if d.get("bipartite") == "job"
]
skill_nodes_sub = [n for n in H_bipartite if n not in job_nodes_sub]
pos = {}
# Align jobs on the left (x=0) and skills on the right (x=1) for clarity.
pos.update((node, (0, idx)) for idx, node in enumerate(job_nodes_sub))
pos.update((node, (1, idx)) for idx, node in enumerate(skill_nodes_sub))
plt.figure(figsize=(12, 8))
nx.draw_networkx_nodes(
H_bipartite,
pos,
nodelist=job_nodes_sub,
node_color="#1f3ab4f0",
node_size=200,
label="Jobs",
)
nx.draw_networkx_nodes(
H_bipartite,
pos,
nodelist=skill_nodes_sub,
node_color="#ff520e",
node_size=200,
label="Skills",
)
nx.draw_networkx_edges(H_bipartite, pos, alpha=0.3)
plt.legend()
plt.title("Bipartite Job >> Skill Graph")
plt.axis("off")
plt.tight_layout()
plt.show()
Lets look at the most popular job titles in the dataset.
title_counts = jobs_clean["title"].value_counts()
title_counts.head(20)
px.bar(
title_counts.head(20),
title="Top 10 titles by Number of Job Postings",
labels={"index": "Job Title", "value": "Number of Job Postings"},
)
Now that we see the type of job titles in the dataset, we can move forward with building the job-skill bipartite graph. lets look at top skills pulled from the job descriptions next.
skill_counts = (
jobs_skill_map.groupby(["skill_name"])
.agg(num_jobs=("id", "count"))
.reset_index()
.sort_values("num_jobs", ascending=False)
)
px.bar(
skill_counts.head(30),
x="skill_name",
y="num_jobs",
#color="skill_category",
title="Top Skills Across Alfred Jobs",
)
from networkx.algorithms import bipartite
# Separate the partitions so we can project skills onto themselves.
skill_nodes = {n for n, d in B.nodes(data=True) if d.get("bipartite") == "skill"}
job_nodes = set(B) - skill_nodes
# Skills are connected when they co-occur in at least one job posting (weighted by frequency).
skill_projection = bipartite.weighted_projected_graph(B, skill_nodes)
print(
f"Skill projection has {skill_projection.number_of_nodes()} nodes / {skill_projection.number_of_edges()} edges"
)
# Degree centrality surfaces the skills that connect to the most neighbors in this co-occurrence graph.
centrality = nx.degree_centrality(skill_projection)
centrality_df = pd.DataFrame(
{
"skill_node": list(skill_projection.nodes()),
"skill_name": [
skill_projection.nodes[n]["label"] for n in skill_projection.nodes
],
"category": [
skill_projection.nodes[n]["category"] for n in skill_projection.nodes
],
"centrality": [centrality[n] for n in skill_projection.nodes],
}
).sort_values("centrality", ascending=False)
centrality_df.head(40)
Skill projection has 128 nodes / 1765 edges
| skill_node | skill_name | category | centrality | |
|---|---|---|---|---|
| 111 | skill_python | Python | Programming_Tools | 0.811024 |
| 1 | skill_sql | Sql | Programming_Tools | 0.803150 |
| 109 | skill_communication_skills | Communication Skills | Soft_Skills_Misc | 0.661417 |
| 116 | skill_aws | Aws | MLOps_DevOps_Cloud | 0.598425 |
| 35 | skill_machine_learning | Machine Learning | Machine_Learning_and_Artificial_Intelligence | 0.590551 |
| ... | ... | ... | ... | ... |
| 94 | skill_powerpoint | Powerpoint | Productivity_Tools | 0.299213 |
| 115 | skill_hipaa | Hipaa | Healthcare_Pharma_Biotech | 0.283465 |
| 95 | skill_scikit-learn | Scikit-Learn | Machine_Learning_and_Artificial_Intelligence | 0.267717 |
| 48 | skill_kpis | Kpis | General_Business_Analytical | 0.267717 |
| 86 | skill_angular | Angular | Software_Development_Engineering | 0.267717 |
40 rows ΓΖΓ’β¬β 4 columns
centrality_df.head(10)
| skill_node | skill_name | category | centrality | |
|---|---|---|---|---|
| 111 | skill_python | Python | Programming_Tools | 0.811024 |
| 1 | skill_sql | Sql | Programming_Tools | 0.803150 |
| 109 | skill_communication_skills | Communication Skills | Soft_Skills_Misc | 0.661417 |
| 116 | skill_aws | Aws | MLOps_DevOps_Cloud | 0.598425 |
| 35 | skill_machine_learning | Machine Learning | Machine_Learning_and_Artificial_Intelligence | 0.590551 |
| 66 | skill_java | Java | Programming_Tools | 0.543307 |
| 100 | skill_agile | Agile | Software_Development_Engineering | 0.527559 |
| 127 | skill_go | Go | Software_Development_Engineering | 0.511811 |
| 61 | skill_tableau | Tableau | Productivity_Tools | 0.503937 |
| 57 | skill_life_insurance | Life Insurance | Insurance_Risk_Underwriting | 0.503937 |
Lets see if we can visualize the top 40 nodes by their degree centrality.
# Focus on the most central skills so the visualization stays interpretable.
top_nodes = centrality_df.head(50)["skill_node"]
H = skill_projection.subgraph(top_nodes).copy()
pos = nx.spring_layout(H, seed=42) # deterministic layout for reproducibility
nodes = list(H.nodes())
x = [pos[n][0] for n in nodes]
y = [pos[n][1] for n in nodes]
size = [10 + 80 * centrality[n] for n in nodes]
categories = pd.Categorical([H.nodes[n]["category"] for n in nodes])
fig = go.Figure()
# edges as line segments
for u, v, data in H.edges(data=True):
fig.add_trace(
go.Scatter(
x=[pos[u][0], pos[v][0]],
y=[pos[u][1], pos[v][1]],
mode="lines",
line=dict(width=max(1, data.get("weight", 1) * 0.2), color="#cccccc"),
hoverinfo="skip",
showlegend=False,
)
)
# nodes
fig.add_trace(
go.Scatter(
x=x,
y=y,
mode="markers+text",
text=[H.nodes[n]["label"] for n in nodes],
textposition="bottom center",
marker=dict(
size=size,
color=categories.codes,
colorscale="Turbo",
line=dict(width=1, color="#333"),
),
hovertext=[
f"{H.nodes[n]['label']}<br>Category: {H.nodes[n]['category']}<br>Centrality: {centrality[n]:.3f}"
for n in nodes
],
hoverinfo="text",
)
)
fig.update_layout(
title="Top Skill Co-occurrence Network",
xaxis=dict(visible=False),
yaxis=dict(visible=False),
showlegend=False,
margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()
Persist Intermediate Tables (Optional)ΒΆ
- You don't need to run this cell if you are starting from the CSV files above.
Export clean job and job/skill mapping tables so they can be fed into Gephi or downstream dashboards without re-querying the live database.
# Persist tidy tables so collaborators can use them without re-running the database queries.
OUTPUT_DIR = PROJECT_ROOT / "artifacts"
OUTPUT_DIR.mkdir(exist_ok=True)
# Parquet for efficient reloads.
jobs_clean.to_parquet(OUTPUT_DIR / "jobs_clean.parquet", index=False)
jobs_skill_map.to_parquet(OUTPUT_DIR / "jobs_skill_map.parquet", index=False)
# CSV exports for easy sharing/export to other tools.
jobs_clean.to_csv(OUTPUT_DIR / "jobs_clean.csv", index=False)
jobs_skill_map.to_csv(OUTPUT_DIR / "jobs_skill_map.csv", index=False)
print(f"Saved artifacts to {OUTPUT_DIR}")
Saved artifacts to d:\Data620\FinalProject\artifacts
Now we can see if we detect any communities among the top skills using the Louvain method. The Louvain method is a popular algorithm for community detection in large networks. It optimizes modularity to find clusters of nodes that are more densely connected internally than with the rest of the network. It does this through a two-phase process: first, it assigns each node to its own community and iteratively merges communities to maximize modularity; second, it builds a new network where nodes represent the detected communities and repeats the process until no further modularity improvement is possible.
Skill "Island" CommunitiesΒΆ
To surface densely connected pockets of skills, we threshold the skill-projection edges by weight and
run a modularity-based community detector. Tweak the WEIGHT_THRESHOLD or drop the filter entirely
if you want to study the full graph.
weights = [data["weight"] for _, _, data in skill_projection.edges(data=True)]
min_weight = min(weights)
max_weight = max(weights)
print(min_weight, max_weight)
WEIGHT_THRESHOLD = 3 # minimum shared-job weight required to keep an edge
filtered_edges = [
(u, v)
for u, v, data in skill_projection.edges(data=True)
if data.get("weight", 0) >= WEIGHT_THRESHOLD
]
if not filtered_edges:
raise ValueError(
"No edges survived the weight threshold. Adjust WEIGHT_THRESHOLD or refine the NLP skill extraction heuristics."
)
# edge_subgraph preserves node attributes (label/category) from the original skill projection
island_graph = skill_projection.edge_subgraph(filtered_edges).copy()
communities = list(nx_comm.greedy_modularity_communities(island_graph, weight="weight"))
community_map = {
node: idx for idx, nodes in enumerate(communities, start=1) for node in nodes
}
community_summary = pd.DataFrame(
[
{
"community_id": idx,
"size": len(nodes),
"sample_skills": ", ".join(
sorted(island_graph.nodes[n]["label"] for n in list(nodes)[:5])
),
}
for idx, nodes in enumerate(communities, start=1)
]
).sort_values("size", ascending=False)
print(f"Detected {len(communities)} dense skill communities after thresholding.")
community_summary.head(20)
1 90 Detected 5 dense skill communities after thresholding.
| community_id | size | sample_skills | |
|---|---|---|---|
| 0 | 1 | 26 | Communication Skills, Flask, Pandas, Sql, Teamwork |
| 1 | 2 | 21 | Classification, Deep Learning, Fraud Detection, Javascript, React |
| 2 | 3 | 18 | Go, Jira, Life Insurance, Microservices, Typescript |
| 3 | 4 | 17 | Airflow, Github, Hadoop, Kubernetes, Scala |
| 4 | 5 | 2 | Credit Card, Statistical Modeling |
# Visualize islands with Plotly for interactive inspection
H = island_graph # already contains filtered nodes with attributes
pos = nx.spring_layout(H, seed=42, k=0.8)
nodes = list(H.nodes())
node_x = [pos[n][0] for n in nodes]
node_y = [pos[n][1] for n in nodes]
node_sizes = [12 + 30 * centrality.get(n, 0) for n in nodes]
communities_cat = pd.Categorical([community_map.get(n, 0) for n in nodes])
fig = go.Figure()
for u, v, data in H.edges(data=True):
fig.add_trace(
go.Scatter(
x=[pos[u][0], pos[v][0]],
y=[pos[u][1], pos[v][1]],
mode="lines",
line=dict(width=max(1, data.get("weight", 1) * 0.15), color="#bbbbbb"),
hoverinfo="skip",
showlegend=False,
)
)
fig.add_trace(
go.Scatter(
x=node_x,
y=node_y,
mode="markers+text",
text=[H.nodes[n]["label"] for n in nodes],
textposition="bottom center",
marker=dict(
size=node_sizes,
color=communities_cat.codes,
colorscale="Plasma",
line=dict(width=1, color="#333"),
),
hovertext=[
f"{H.nodes[n]['label']}<br>Community: {community_map.get(n)}<br>Degree: {H.degree(n)}"
for n in nodes
],
hoverinfo="text",
showlegend=False,
)
)
fig.update_layout(
title="Skill Islands (communities on weighted projection)",
xaxis=dict(visible=False),
yaxis=dict(visible=False),
margin=dict(l=20, r=20, t=60, b=20),
)
fig.write_html("skill_islands_network.html", auto_open=True)
fig.write_json("skill_islands_visualization.json")
Dyanmic -Analysis NLP Skill ExtractionΒΆ
After completing the dictionary-driven network analysis above, we can optionally
generate a dynamic skill inventory with spaCy to compare coverage. This cell works on a copy of
jobs_clean so the baseline results stay intact.
We use spaCy's PhraseMatcher to identify skill phrases in job descriptions. We then rebuild the job-skill bipartite graph using the NLP-extracted skills and repeat the centrality and community detection analyses to compare against the baseline. We add some stopword skills to filter out common but uninformative terms.
import itertools
from collections import Counter
import spacy
from spacy.cli import download as spacy_download
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.cluster import MiniBatchKMeans
from sklearn.feature_extraction.text import TfidfVectorizer
# make a copy of jobs_clean to avoid modifying the baseline skill extraction results
jobs_clean_nlp = jobs_clean.copy().reset_index(drop=True)
try:
nlp = spacy.load("en_core_web_sm")
except OSError:
spacy_download("en_core_web_sm")
nlp = spacy.load("en_core_web_sm")
# Adjust spaCy's max_length if any job description exceeds the default limit.
max_desc_length = int(jobs_clean_nlp["description"].fillna("").str.len().max())
if max_desc_length:
nlp.max_length = max(nlp.max_length, int(max_desc_length * 1.2))
# Define custom stopwords and filtering heuristics.
GENERIC_TERMS = {
"ability",
"abilities",
"team",
"teams",
"company",
"companies",
"business",
"clients",
"client",
"stakeholders",
"stakeholder",
"role",
"roles",
"position",
"positions",
"responsibilities",
"responsibility",
"experience",
"experiences",
"people",
"solutions",
"services",
"product",
"products",
"organization",
"organizations",
"environment",
"environments",
"technology",
"technologies",
"year",
"years",
"degree",
"degrees",
"applicant",
"applicants",
"job",
"jobs",
"candidate",
"candidates",
"orientation","Data Engineer","Data Scientist"
}
# Words that are unlikely to be part of a skill phrase when they appear at the end.
BAD_ENDINGS = {"team", "teams", "company", "business", "clients", "stakeholders", "organization", "environment", "technology", "year", "job"}
# some custom stopwords relevant to job descriptions
CUSTOM_STOPWORDS = {
"technology",
"technologies",
"year",
"years",
"degree",
"degrees",
"applicant",
"applicants",
"job",
"jobs",
"candidate",
"candidates",
"orientation",
"team",
"teams",
"company",
"business",
"people","Title"
}
STOPWORD_SET = {word.lower() for word in STOP_WORDS}.union(CUSTOM_STOPWORDS)
DICTIONARY_SINGLE_SKILLS = {
skill.lower()
for skills in SKILLS_DICTIONARY.values()
for skill in skills
}
SINGLE_SKILL_ALLOWLIST = DICTIONARY_SINGLE_SKILLS.union(
{
"data",
"analytics",
"analysis",
"integration",
"engineering",
"modeling",
"machine",
"learning",
"ai",
"ml",
"python",
"sql",
"spark",
"aws",
"azure",
"gcp",
}
)
ANCHOR_TERMS = {
"data",
"analytics",
"analysis",
"integration",
"science",
"engineering",
"machine",
"learning",
"intelligence",
"modeling",
"ai",
"ml",
"cloud",
"platform",
"product",
"pipeline",
"governance",
"visualization",
"automation",
"insight",
"insights",
}
# Normalization function to clean and standardize phrases.
def normalize_phrase(text: str) -> str:
text = text.lower()
text = re.sub(r"[^a-z0-9+#/&\- ]+", " ", text)
text = re.sub(r"\s+", " ", text).strip()
return text
# Token filtering functions.
def token_is_candidate(token) -> bool:
if token.is_stop or token.is_punct:
return False
lemma = token.lemma_.lower()
if lemma in STOPWORD_SET:
return False
return token.pos_ in {"PROPN", "NOUN", "ADJ"}
# Heuristic to allow certain single-token skills.
def allow_single_token(token, norm: str) -> bool:
lemma = token.lemma_.lower()
text = token.text
if lemma in SINGLE_SKILL_ALLOWLIST:
return True
if text.isupper() and len(text) <= 4:
return True
if any(char.isdigit() for char in text):
return True
return False
# Heuristic to filter out non-skill-like phrases.
def is_skill_like(text: str) -> bool:
if not text or len(text) < 3:
return False
words = text.split()
if len(words) > 4:
return False
if all(word in GENERIC_TERMS for word in words):
return False
if words[-1] in BAD_ENDINGS:
return False
if text in STOPWORD_SET:
return False
return True
# Main extraction function for a spaCy doc.
def extract_skill_candidates(doc):
phrases = set()
candidate_tokens = [token for token in doc if token_is_candidate(token)]
for chunk in doc.noun_chunks:
norm = normalize_phrase(chunk.text)
if not is_skill_like(norm):
continue
words = norm.split()
if len(words) == 1:
token = chunk.root
if allow_single_token(token, norm):
phrases.add(norm)
else:
if any(word in ANCHOR_TERMS for word in words):
phrases.add(norm)
# Single-token candidates
for token in candidate_tokens:
norm = normalize_phrase(token.text)
if is_skill_like(norm) and allow_single_token(token, norm):
phrases.add(norm)
# Multi-token candidates (2-3 grams)
max_window = 3
for window in range(2, max_window + 1):
for i in range(len(candidate_tokens) - window + 1):
phrase_tokens = candidate_tokens[i : i + window]
lemma_window = [token.lemma_.lower() for token in phrase_tokens]
if not any(lemma in ANCHOR_TERMS for lemma in lemma_window):
continue
phrase = " ".join(token.text for token in phrase_tokens)
norm = normalize_phrase(phrase)
if is_skill_like(norm):
phrases.add(norm)
return sorted(phrases)
# Extract skill candidates from each job description.
descriptions = jobs_clean_nlp["description"].fillna("")
skill_candidates = []
for doc in tqdm(
nlp.pipe(descriptions.tolist(), batch_size=32),
total=len(descriptions),
desc="Extracting skill candidates",
):
skill_candidates.append(extract_skill_candidates(doc))
jobs_clean_nlp["skill_candidates_raw"] = skill_candidates
candidate_counts = Counter(itertools.chain.from_iterable(skill_candidates))
if not candidate_counts:
raise ValueError("spaCy-based skill extraction did not produce any candidates.")
# Apply global frequency filtering to build the skill vocabulary.
MIN_GLOBAL_FREQ = 5
MAX_GLOBAL_SHARE = 0.65
max_allowed = max(1, int(MAX_GLOBAL_SHARE * len(jobs_clean_nlp)))
skill_vocabulary = {
phrase
for phrase, count in candidate_counts.items()
if count >= MIN_GLOBAL_FREQ and count <= max_allowed
and phrase not in STOPWORD_SET
}
# Finalize skill list for each job by filtering to the vocabulary and deduplicating.
def finalize_skills(candidate_list):
filtered = [phrase for phrase in candidate_list if phrase in skill_vocabulary]
return sorted(set(filtered))
# Apply finalization to each job's candidates.
jobs_clean_nlp["skills_nlp"] = [
[phrase.title() for phrase in finalize_skills(candidates)]
for candidates in skill_candidates
]
# Build the job-skill mapping DataFrame from NLP-extracted skills.
jobs_skill_map_dynamic = (
jobs_clean_nlp[["id", "job_label", "company", "skills_nlp"]]
.explode("skills_nlp")
.dropna(subset=["skills_nlp"])
.rename(columns={"skills_nlp": "skill_name"})
)
# Ensure we have some skills after filtering.
if jobs_skill_map_dynamic.empty:
raise ValueError("No NLP-derived skills survived the filtering heuristics.")
# Prepare for clustering by normalizing skill names.
jobs_skill_map_dynamic["skill_name"] = jobs_skill_map_dynamic["skill_name"].str.strip()
# Cluster similar skills using TF-IDF + MiniBatchKMeans.
unique_skills = jobs_skill_map_dynamic["skill_name"].str.lower().unique().tolist()
skill_category_lookup = {}
if unique_skills:
vectorizer = TfidfVectorizer(ngram_range=(1, 2), stop_words="english")
skill_matrix = vectorizer.fit_transform(unique_skills)
n_clusters = min(8, len(unique_skills))
if n_clusters == 1:
cluster_labels = [0]
else:
clusterer = MiniBatchKMeans(
n_clusters=n_clusters,
random_state=42,
batch_size=min(256, len(unique_skills)),
max_iter=200,
)
cluster_labels = clusterer.fit_predict(skill_matrix)
skill_category_lookup = {
skill: f"NLP Cluster {label + 1}"
for skill, label in zip(unique_skills, cluster_labels)
}
# Map clustered categories back to the job-skill mapping.
jobs_skill_map_dynamic["skill_category"] = jobs_skill_map_dynamic["skill_name"].str.lower().map(
skill_category_lookup
)
jobs_skill_map_dynamic["skill_category"] = jobs_skill_map_dynamic["skill_category"].fillna("NLP Cluster 1")
jobs_skill_map_dynamic["skill_strength"] = jobs_skill_map_dynamic["skill_name"].str.lower().map(candidate_counts)
# Summary of NLP-extracted skills.
print(
f"NLP extractor tagged {jobs_skill_map_dynamic['skill_name'].nunique():,} unique skills across {jobs_skill_map_dynamic['id'].nunique():,} jobs."
)
Extracting skill candidates: 0%| | 0/1208 [00:00<?, ?it/s]
NLP extractor tagged 965 unique skills across 1,026 jobs.
skill_extraction_summary = pd.DataFrame(
[
{
"method": "Dictionary",
"unique_skills": jobs_skill_map_dictionary["skill_name"].nunique(),
"jobs_with_skills": jobs_skill_map_dictionary["id"].nunique(),
},
{
"method": "NLP",
"unique_skills": jobs_skill_map_dynamic["skill_name"].nunique(),
"jobs_with_skills": jobs_skill_map_dynamic["id"].nunique(),
},
]
)
skill_extraction_summary
| method | unique_skills | jobs_with_skills | |
|---|---|---|---|
| 0 | Dictionary | 128 | 490 |
| 1 | NLP | 965 | 1026 |
skill_frequency = (
jobs_skill_map_dynamic["skill_name"].value_counts().rename_axis("skill_name").reset_index(name="mention_count")
)
skill_frequency
| skill_name | mention_count | |
|---|---|---|
| 0 | Data | 627 |
| 1 | Engineering | 389 |
| 2 | Data Engineer | 359 |
| 3 | Analysis | 259 |
| 4 | Data Engineering | 202 |
| ... | ... | ... |
| 960 | Ai Machine | 5 |
| 961 | Emr | 5 |
| 962 | Confluence | 5 |
| 963 | System Integration | 5 |
| 964 | Product Backlog | 5 |
965 rows Γβ 2 columns
from spacy.lang.en.stop_words import STOP_WORDS
custom_stopwords = {
"technology",
"technologies",
"year",
"years",
"degree",
"degrees",
"applicant",
"applicants",
"job",
"jobs",
"candidate",
"candidates",
"orientation",
"team",
"teams",
"company",
"business",
"people",
"Excellent",
"Which",
"Must",
"Will",
"Work",
"Using",
"knowledge",
"skills",
"high",
"engineer",
"Science",
"an employee",
"aware",
"Service",
"Services",
"Based",
"Level",
"Levels",
"Various",
"including",
"related",
"field",
"fields",
"areas",
"area",
"Entreprise",
"enterprises",
"The role",
"ability",
"abilities",
"stakeholder",
"stakeholders",
"data",
"organization",
"organizations",
"environment",
"environments",
"range",
"the range",
"life",
"lives",
"the life",
"datum",
"platform",
"engineering",
"new",
"design",
"analysts",
"analyst",
"functional",
"industry",
"problem",
"tool",
"decision",
"application",
"dtrong",
"day",
"cross",
"news",
"development",
"this role",
"solution",
"time",
"world",
"system",
"systems",
"opportunity",
"opportunities",
"technical",
"skill",
"software",
"quality",
"customers",
"customer",
"location",
"locations",
"process",
"impact",
"impacts",
"member",
"description",
"project",
"strong",
"benefit",
"analysis",
"support",
"enterprise",
"enterprises",
"management",
"employee","information","informations","end","lead","employment",
"work","works","initiative","requirement","qualification","qualifications",
"complex","complexity","collaborate","collaborates","collaboration", "disability","diversity", "innovation","innovative","infrastructure","expertise",
"performance","performances",
"methodology","methodologies","method","methods","status","euality","equal","employer","employers","global","analytics","analytic",
"individual","Office","401 K","U S","USD"
}
custom_stopwords = {word.lower() for word in custom_stopwords}
stopword_flags = skill_frequency.assign(
skill_lower=lambda df: df["skill_name"].str.lower(),
spacy_stop=lambda df: df["skill_lower"].isin(STOP_WORDS),
custom_stop=lambda df: df["skill_lower"].isin(custom_stopwords),
)
stopword_flags[["skill_name", "mention_count", "spacy_stop", "custom_stop"]]
stopword_flags_filtered = stopword_flags[
~(stopword_flags["spacy_stop"] | stopword_flags["custom_stop"])
].reset_index(drop=True)
stopword_flags_filtered.head(20)
| skill_name | mention_count | skill_lower | spacy_stop | custom_stop | |
|---|---|---|---|---|---|
| 0 | Data Engineer | 359 | data engineer | False | False |
| 1 | Data Engineering | 202 | data engineering | False | False |
| 2 | Azure | 159 | azure | False | False |
| 3 | Integration | 148 | integration | False | False |
| 4 | Python | 135 | python | False | False |
| 5 | Sql | 135 | sql | False | False |
| 6 | Computer Science | 123 | computer science | False | False |
| 7 | Azure Data | 115 | azure data | False | False |
| 8 | Data Pipelines | 115 | data pipelines | False | False |
| 9 | Data Engineer Impact | 114 | data engineer impact | False | False |
| 10 | Data Platforms | 114 | data platforms | False | False |
| 11 | Azure Data Engineer | 112 | azure data engineer | False | False |
| 12 | Data Integration | 111 | data integration | False | False |
| 13 | Product Owners | 107 | product owners | False | False |
| 14 | Engineering Analytics | 105 | engineering analytics | False | False |
| 15 | Data Warehousing | 104 | data warehousing | False | False |
| 16 | Data Engineering Analytics | 103 | data engineering analytics | False | False |
| 17 | Member Data | 102 | member data | False | False |
| 18 | Analysis Development | 101 | analysis development | False | False |
| 19 | Data Warehousing Cloud | 101 | data warehousing cloud | False | False |
# Filter NLP-derived skills to drop stop words before ranking
stopword_set = {word.lower() for word in STOP_WORDS}.union(custom_stopwords)
jobs_skill_map_dynamic = jobs_skill_map_dynamic[
~jobs_skill_map_dynamic["skill_name"].str.lower().isin(stopword_set)
].copy()
skill_frequency = (
jobs_skill_map_dynamic["skill_name"].value_counts()
.rename_axis("skill_name")
.reset_index(name="mention_count")
)
skill_frequency.head(50)
| skill_name | mention_count | |
|---|---|---|
| 0 | Data Engineer | 359 |
| 1 | Data Engineering | 202 |
| 2 | Azure | 159 |
| 3 | Integration | 148 |
| 4 | Sql | 135 |
| ... | ... | ... |
| 45 | Learning | 70 |
| 46 | Machine | 70 |
| 47 | Artificial Intelligence | 65 |
| 48 | Etl | 58 |
| 49 | Java | 58 |
50 rows Γβ 2 columns
skill_freq_ranked = (
skill_frequency.sort_values("mention_count", ascending=False)
.reset_index(drop=True)
.reset_index()
.rename(columns={"index": "skill_rank"})
)
fig = px.bar(
skill_freq_ranked,
x="skill_rank",
y="mention_count",
title="Skill Mention Frequency (NLP Extraction)",
labels={"skill_rank": "Skill Rank", "mention_count": "Mentions"},
)
fig.update_traces(
hovertemplate="Rank %{x}<br>Skill: %{customdata[0]}<br>Mentions: %{y}",
customdata=skill_freq_ranked[["skill_name"]].values,
)
fig.show()
NLP Skill GraphsΒΆ
Reconstruct the job-skill network using the NLP-derived skills and compare it against the dictionary-based graph.
# Build the NLP-based bipartite graph and projected skill network
try:
jobs_skill_map_dynamic
jobs_clean_nlp
except NameError as exc: # pragma: no cover
raise RuntimeError("Run the NLP extraction cells before this section.") from exc
# Build the bipartite job-skill graph (jobs on one partition, skills on the other).
B_nlp = nx.Graph()
for _, row in jobs_clean_nlp.iterrows():
job_node = f"job_{row['id']}"
B_nlp.add_node(
job_node,
bipartite="job",
label=row.get("job_label", row.get("title", "Job")),
company=row.get("company", "Unknown"),
location=row.get("location", "Unknown"),
)
# Add skill nodes and connect them to the jobs that mention them.
for _, row in jobs_skill_map_dynamic.iterrows():
job_node = f"job_{row['id']}"
skill_node = f"skill_{row['skill_name'].lower().replace(' ', '_')}"
B_nlp.add_node(
skill_node,
bipartite="skill",
label=row["skill_name"],
category=row.get("skill_category", "NLP"),
)
if B_nlp.has_node(job_node):
B_nlp.add_edge(job_node, skill_node)
# Ensure the graph is not empty.
if B_nlp.number_of_nodes() == 0:
raise ValueError("NLP bipartite graph is empty. Ensure jobs_skill_map_dynamic is populated.")
# Summarize the NLP bipartite graph.
skill_nodes_nlp = {n for n, d in B_nlp.nodes(data=True) if d.get("bipartite") == "skill"}
job_nodes_nlp = set(B_nlp) - skill_nodes_nlp
print(f"B_nlp contains {len(job_nodes_nlp):,} job nodes, {len(skill_nodes_nlp):,} skill nodes, and {B_nlp.number_of_edges():,} edges.")
# Project skills onto themselves based on co-occurrence in job postings.
skill_projection_nlp = bipartite.weighted_projected_graph(B_nlp, skill_nodes_nlp)
print(
f"NLP skill projection has {skill_projection_nlp.number_of_nodes():,} nodes / {skill_projection_nlp.number_of_edges():,} edges"
)
# Compute degree centrality on the NLP skill projection.
centrality_nlp = nx.degree_centrality(skill_projection_nlp)
centrality_df_nlp = pd.DataFrame(
{
"skill_node": list(skill_projection_nlp.nodes()),
"skill_name": [skill_projection_nlp.nodes[n]["label"] for n in skill_projection_nlp.nodes],
"category": [skill_projection_nlp.nodes[n].get("category", "NLP") for n in skill_projection_nlp.nodes],
"centrality": [centrality_nlp[n] for n in skill_projection_nlp.nodes],
}
).sort_values("centrality", ascending=False)
centrality_df_nlp.head(20)
B_nlp contains 1,208 job nodes, 958 skill nodes, and 17,652 edges. NLP skill projection has 958 nodes / 47,328 edges
| skill_node | skill_name | category | centrality | |
|---|---|---|---|---|
| 460 | skill_python | Python | NLP Cluster 4 | 0.613375 |
| 735 | skill_sql | Sql | NLP Cluster 4 | 0.589342 |
| 429 | skill_data_engineer | Data Engineer | NLP Cluster 4 | 0.584117 |
| 457 | skill_computer_science | Computer Science | NLP Cluster 4 | 0.569488 |
| 647 | skill_learning | Learning | NLP Cluster 4 | 0.538140 |
| 472 | skill_aws | Aws | NLP Cluster 4 | 0.518286 |
| 58 | skill_data_pipelines | Data Pipelines | NLP Cluster 8 | 0.504702 |
| 449 | skill_data_engineering | Data Engineering | NLP Cluster 4 | 0.495298 |
| 60 | skill_azure | Azure | NLP Cluster 4 | 0.462905 |
| 244 | skill_etl | Etl | NLP Cluster 4 | 0.455590 |
| 951 | skill_modeling | Modeling | NLP Cluster 4 | 0.432602 |
| 764 | skill_integration | Integration | NLP Cluster 4 | 0.425287 |
| 777 | skill_machine | Machine | NLP Cluster 4 | 0.422153 |
| 601 | skill_bachelor_computer_science | Bachelor Computer Science | NLP Cluster 4 | 0.409613 |
| 849 | skill_experience_data | Experience Data | NLP Cluster 3 | 0.385580 |
| 272 | skill_engineering_related | Engineering Related | NLP Cluster 4 | 0.371996 |
| 827 | skill_java | Java | NLP Cluster 4 | 0.366771 |
| 295 | skill_spark | Spark | NLP Cluster 4 | 0.365726 |
| 311 | skill_machine_learning | Machine Learning | NLP Cluster 4 | 0.364681 |
| 668 | skill_data_analysis | Data Analysis | NLP Cluster 4 | 0.364681 |
def summarize_projection(graph, skill_nodes_count, job_nodes_count, projection):
degree_values = [deg for _, deg in projection.degree()]
avg_degree = (sum(degree_values) / len(degree_values)) if degree_values else 0
return {
"jobs": job_nodes_count,
"skills": skill_nodes_count,
"job_skill_edges": graph.number_of_edges(),
"projection_nodes": projection.number_of_nodes(),
"projection_edges": projection.number_of_edges(),
"projection_avg_degree": avg_degree,
}
summary_df = pd.DataFrame(
[
{"pipeline": "Dictionary", **summarize_projection(B, len(skill_nodes), len(job_nodes), skill_projection)},
{"pipeline": "NLP", **summarize_projection(B_nlp, len(skill_nodes_nlp), len(job_nodes_nlp), skill_projection_nlp)},
]
)
summary_df
| pipeline | jobs | skills | job_skill_edges | projection_nodes | projection_edges | projection_avg_degree | |
|---|---|---|---|---|---|---|---|
| 0 | Dictionary | 1208 | 128 | 1973 | 128 | 1765 | 27.578125 |
| 1 | NLP | 1208 | 958 | 17652 | 958 | 47328 | 98.805846 |
top_k = 20
top_centrality = pd.concat(
[
centrality_df.head(top_k).assign(pipeline="Dictionary"),
centrality_df_nlp.head(top_k).assign(pipeline="NLP"),
],
ignore_index=True,
)
top_centrality
| skill_node | skill_name | category | centrality | pipeline | |
|---|---|---|---|---|---|
| 0 | skill_python | Python | Programming_Tools | 0.811024 | Dictionary |
| 1 | skill_sql | Sql | Programming_Tools | 0.803150 | Dictionary |
| 2 | skill_communication_skills | Communication Skills | Soft_Skills_Misc | 0.661417 | Dictionary |
| 3 | skill_aws | Aws | MLOps_DevOps_Cloud | 0.598425 | Dictionary |
| 4 | skill_machine_learning | Machine Learning | Machine_Learning_and_Artificial_Intelligence | 0.590551 | Dictionary |
| ... | ... | ... | ... | ... | ... |
| 35 | skill_engineering_related | Engineering Related | NLP Cluster 4 | 0.371996 | NLP |
| 36 | skill_java | Java | NLP Cluster 4 | 0.366771 | NLP |
| 37 | skill_spark | Spark | NLP Cluster 4 | 0.365726 | NLP |
| 38 | skill_machine_learning | Machine Learning | NLP Cluster 4 | 0.364681 | NLP |
| 39 | skill_data_analysis | Data Analysis | NLP Cluster 4 | 0.364681 | NLP |
40 rows Γβ 5 columns
MAX_JOBS_FOR_NLP_BIPARTITE_PLOT = jobs_clean_nlp.shape[0]
sample_job_nodes_nlp = [
f"job_{job_id}" for job_id in jobs_clean_nlp.head(MAX_JOBS_FOR_NLP_BIPARTITE_PLOT)["id"]
]
connected_skill_nodes_nlp = set()
for job in sample_job_nodes_nlp:
if B_nlp.has_node(job):
connected_skill_nodes_nlp.update(B_nlp.neighbors(job))
subgraph_nodes_nlp = sample_job_nodes_nlp + list(connected_skill_nodes_nlp)
H_bipartite_nlp = B_nlp.subgraph(subgraph_nodes_nlp).copy()
if not H_bipartite_nlp:
raise ValueError("NLP bipartite subgraph is empty. Ensure jobs_skill_map_dynamic is populated before plotting.")
job_nodes_sub_nlp = [n for n, d in H_bipartite_nlp.nodes(data=True) if d.get("bipartite") == "job"]
skill_nodes_sub_nlp = [n for n in H_bipartite_nlp if n not in job_nodes_sub_nlp]
pos_nlp = {}
pos_nlp.update((node, (0, idx)) for idx, node in enumerate(job_nodes_sub_nlp))
pos_nlp.update((node, (1, idx)) for idx, node in enumerate(skill_nodes_sub_nlp))
plt.figure(figsize=(12, 8))
nx.draw_networkx_nodes(
H_bipartite_nlp,
pos_nlp,
nodelist=job_nodes_sub_nlp,
node_color="#1f3ab4f0",
node_size=200,
label="Jobs",
)
nx.draw_networkx_nodes(
H_bipartite_nlp,
pos_nlp,
nodelist=skill_nodes_sub_nlp,
node_color="#ff520e",
node_size=200,
label="Skills",
)
nx.draw_networkx_edges(H_bipartite_nlp, pos_nlp, alpha=0.3)
plt.legend()
plt.title("NLP Bipartite Job >> Skill Graph")
plt.axis("off")
plt.tight_layout()
plt.show()
top_nodes_nlp = centrality_df_nlp.head(100)["skill_node"]
H_nlp_proj = skill_projection_nlp.subgraph(top_nodes_nlp).copy()
pos_proj_nlp = nx.spring_layout(H_nlp_proj, seed=42)
nodes = list(H_nlp_proj.nodes())
x = [pos_proj_nlp[n][0] for n in nodes]
y = [pos_proj_nlp[n][1] for n in nodes]
size = [10 + 80 * centrality_nlp.get(n, 0) for n in nodes]
fig = go.Figure()
for u, v, data in H_nlp_proj.edges(data=True):
fig.add_trace(
go.Scatter(
x=[pos_proj_nlp[u][0], pos_proj_nlp[v][0]],
y=[pos_proj_nlp[u][1], pos_proj_nlp[v][1]],
mode="lines",
line=dict(width=max(1, data.get("weight", 1) * 0.2), color="#cccccc"),
hoverinfo="skip",
showlegend=False,
)
)
fig.add_trace(
go.Scatter(
x=x,
y=y,
mode="markers+text",
text=[H_nlp_proj.nodes[n]["label"] for n in nodes],
textposition="bottom center",
marker=dict(size=size, color="#9467bd", line=dict(width=1, color="#333")),
hovertext=[
f"{H_nlp_proj.nodes[n]['label']}<br>Degree: {H_nlp_proj.degree(n)}"
for n in nodes
],
hoverinfo="text",
)
)
fig.update_layout(
title="NLP Skill Projection (Top 100 nodes)",
xaxis=dict(visible=False),
yaxis=dict(visible=False),
margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()
NLP Skill IslandsΒΆ
Apply the same weight-thresholded community detection on the NLP skill projection.
weights_nlp = [data["weight"] for _, _, data in skill_projection_nlp.edges(data=True)]
if not weights_nlp:
raise ValueError("NLP skill projection has no edges to analyze.")
print(min(weights_nlp), max(weights_nlp))
WEIGHT_THRESHOLD_NLP = 20
filtered_edges_nlp = [
(u, v)
for u, v, data in skill_projection_nlp.edges(data=True)
if data.get("weight", 0) >= WEIGHT_THRESHOLD_NLP
]
if not filtered_edges_nlp:
raise ValueError("No NLP edges survived the weight threshold. Adjust WEIGHT_THRESHOLD_NLP or revisit the skill extraction.")
island_graph_nlp = skill_projection_nlp.edge_subgraph(filtered_edges_nlp).copy()
communities_nlp = list(nx_comm.greedy_modularity_communities(island_graph_nlp, weight="weight"))
community_map_nlp = {
node: idx for idx, nodes in enumerate(communities_nlp, start=1) for node in nodes
}
community_summary_nlp = pd.DataFrame(
[
{
"community_id": idx,
"size": len(nodes),
"sample_skills": ", ".join(
sorted(island_graph_nlp.nodes[n]["label"] for n in list(nodes)[:5])
),
}
for idx, nodes in enumerate(communities_nlp, start=1)
]
).sort_values("size", ascending=False)
print(f"Detected {len(communities_nlp)} NLP skill communities after thresholding.")
community_summary_nlp.head(20)
1 166 Detected 10 NLP skill communities after thresholding.
| community_id | size | sample_skills | |
|---|---|---|---|
| 0 | 1 | 62 | Chief Artificial Intelligence, Chief Data Officer, Etl, Experience Data, Sql |
| 1 | 2 | 50 | Azure, Data Engineering, End Data Integration, Enterprise Data Platforms, Warehousing Cloud |
| 2 | 3 | 33 | Data Engineer New, Data Quality, Insights Ready Decisions, Main Focus Data, Visualization Servic... |
| 3 | 4 | 17 | Data Platform, Engineering Challenges, Platform Infrastructure, Platform Wide Design, Vision Data |
| 4 | 5 | 15 | Advances Ai Data, Ai Data Computational, Future Science Access, Roche Advances Ai, Sciences Drug |
| 5 | 6 | 15 | Code Ai Idea, Figma Platform, Figma Platform Ideas, Figma S Platform, Platform Ideas |
| 6 | 7 | 13 | Cloud Solutions, Cloud Solutions Oracle, Oci, Products Industries, World Leader Cloud |
| 7 | 8 | 3 | Hadoop, Title Data, Title Data Engineer |
| 8 | 9 | 3 | Cnbc, Nbc, Usa |
| 9 | 10 | 3 | A Senior Data Engineer, Senior Data, Senior Data Engineer |
H = island_graph_nlp
pos = nx.spring_layout(H, seed=42, k=0.8)
nodes = list(H.nodes())
node_x = [pos[n][0] for n in nodes]
node_y = [pos[n][1] for n in nodes]
node_sizes = [12 + 30 * centrality_nlp.get(n, 0) for n in nodes]
communities_cat = pd.Categorical([community_map_nlp.get(n, 0) for n in nodes])
fig = go.Figure()
for u, v, data in H.edges(data=True):
fig.add_trace(
go.Scatter(
x=[pos[u][0], pos[v][0]],
y=[pos[u][1], pos[v][1]],
mode="lines",
line=dict(width=max(1, data.get("weight", 1) * 0.15), color="#bbbbbb"),
hoverinfo="skip",
showlegend=False,
)
)
fig.add_trace(
go.Scatter(
x=node_x,
y=node_y,
mode="markers+text",
#text=[H.nodes[n]["label"] for n in nodes],
#textposition="bottom center",
marker=dict(
size=node_sizes,
color=communities_cat.codes,
colorscale="Plasma",
line=dict(width=1, color="#333"),
),
hovertext=[
f"{H.nodes[n]['label']}<br>Community: {community_map_nlp.get(n)}<br>Degree: {H.degree(n)}"
for n in nodes
],
hoverinfo="text",
showlegend=False,
)
)
fig.update_layout(
title="NLP Skill Islands (communities on weighted projection)",
xaxis=dict(visible=False),
yaxis=dict(visible=False),
margin=dict(l=20, r=20, t=60, b=20),
)
fig.show()
print(top_centrality.head(5).to_markdown())
print(top_centrality.iloc[20:24].to_markdown())
| | skill_node | skill_name | category | centrality | pipeline | |---:|:---------------------------|:---------------------|:---------------------------------------------|-------------:|:-----------| | 0 | skill_python | Python | Programming_Tools | 0.811024 | Dictionary | | 1 | skill_sql | Sql | Programming_Tools | 0.80315 | Dictionary | | 2 | skill_communication_skills | Communication Skills | Soft_Skills_Misc | 0.661417 | Dictionary | | 3 | skill_aws | Aws | MLOps_DevOps_Cloud | 0.598425 | Dictionary | | 4 | skill_machine_learning | Machine Learning | Machine_Learning_and_Artificial_Intelligence | 0.590551 | Dictionary | | | skill_node | skill_name | category | centrality | pipeline | |---:|:-----------------------|:-----------------|:--------------|-------------:|:-----------| | 20 | skill_python | Python | NLP Cluster 4 | 0.613375 | NLP | | 21 | skill_sql | Sql | NLP Cluster 4 | 0.589342 | NLP | | 22 | skill_data_engineer | Data Engineer | NLP Cluster 4 | 0.584117 | NLP | | 23 | skill_computer_science | Computer Science | NLP Cluster 4 | 0.569488 | NLP |
fig = px.bar(
top_centrality,
x="skill_name",
y="centrality",
color="pipeline",
barmode="group",
title="Top Skill Centrality by Pipeline",
labels={"skill_name": "Skill", "centrality": "Degree Centrality"},
)
fig.update_layout(
xaxis_tickangle=45,
xaxis_tickmode="linear",
xaxis_categoryorder="total descending",
yaxis=dict(title="Degree Centrality", rangemode="tozero"),
legend_title="Pipeline",
)
fig.show()
Project SummaryΒΆ
We evaluated Alfred job postings through dual job to skill pipelines, building bipartite graphs, Louvain-based skill islands, and centrality comparisons to pinpoint core skill cohorts job seekers should prioritize for targeted search strategies.
We can see that the NLP-driven skill extraction uncovers broader skills and denser connections, it is also more prone to noise without careful phrase curation. The dictionary-based approach offers precision but may miss emerging or nuanced skills. Combining both methods could yield a comprehensive skill map for job seekers.
The differences in community structures highlight how skill relationships vary based on extraction methods, informing Alfred's job matching algorithms and user guidance. While the job descriptions in this dataset are specific to Alfred's postings, the methodologies applied here can be generalized to other job boards or industries with appropriate adjustments to skill dictionaries and NLP models. Improvement can include implementing a small llm to help rationalize skill extractions and indeed in industry it appears several job boards are implementing some of this functionality already. These include ai powered job description analyzers that suggest skills to add to ones resume based on the job description. But even there, there is significant room for improvement as many of these tools still miss key skills and naunces based of minor textutal differences and human language complexity.
The Projects was succeful in demostrating the power of graphs analysis to help job seekers, for instance I realize that focusing on shapening my SQL skills and python along with big data tools like spark and haddop will make my resume appeal more to recruiters in the data engineering field.